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Abstract 

Consider a clique of n nodes, where in each synchronous round each pair of nodes can 
exchange O(logn) bits. We provide deterministic constant-time solutions for two problems in 
this model. The first is a routing problem where each node is source and destination of n 
messages of size O(logn). The second is a sorting problem where each node is given n keys of 
size O(logn) and needs to learn their positions in the sorted sequence (either with or without 
duplicate keys) . The latter result also implies deterministic constant- round solutions for related 
problems such as selection or determining modes. 



1 Introduction Sz Related Work 



Arguably, one of the most fundamental questions in distributed computing is what amount of 
communication is required to solve a given task. For systems where communication is dominating 
the "cost" — be it the time to communicate information, the money to purchase or rent the required 
infrastructure, or any other measure derived from a notion of communication complexity — exploring 
the imposed limitations may lead to more efficient solutions. 

Clearly, in such systems it does not make sense to make the complete input available to all 
nodes, as this would be too expensive; typically, the same is true for the output. For this reason, 
one assumes that each node is given a part of the input, and each node needs to compute a 
corresponding part of the output. For graph theoretic questions, the local input comprises the 
neighborhood of the node in the respective graph, potentially augmented by weights for its incident 
edges or similar information that is part of the problem specification. The local output then 
e.g. consists of indication of membership in a set forming the global solution (a dominating set, 
independent set, vertex cover, etc.), a value between and 1 (for the fractional versions), a color, 
etc. For verification problems, one is satisfied if for a valid solution all nodes output "yes" and at 
least one node outputs "no" for an invalid solution. 

Since the advent of distributed computing, a main research focus has been the locality of such 
computational problems. Obviously, one cannot compute, or even verify, a spanning tree in less than 
D synchronous communication rounds, where D is the diameter of the graph, as it is impossible 
to ensure that a subgraph is acyclic without knowing it completely. Formally, the respective lower 
bound argues that there are instances for which no node can reliably distinguish between a tree 
and a non-tree since only the local graph topology (and the parts of the prospective solution) up 
to distance R can affect the information available to a node after R rounds. More subtle such 
indistinguishability results apply to problems that can be solved in o(D) time (see e.g. [31 El [7]). 

This type of argument breaks down in systems where all nodes can communicate directly or 
within a few number of rounds. However, this does not necessitate the existence of efficient solu- 
tions, as due to limited bandwidth usually one has to be selective in what information to actually 
communicate. This renders otherwise trivial tasks much harder, giving rise to strong lower bounds. 
For instance, there are n-node graphs of constant diameter on which finding or verifying a spanning 
tree and many related problems require f^-^/re) rounds if messages contain a number of bits that 
is polylogarithmic in n [10]; approximating the diameter up to factor 3/2 — e or determining it 
exactly cannot be done in d{y/n) and o(n) rounds, respectively (2j. These and similar lower bounds 
consider specific graphs whose topology prohibits to communicate efficiently. While the diameters 
of these graphs are low, necessitating a certain connectivity, the edges ensuring this property are 
few. Hence, it is impossible to transmit a linear amount of bits between some nodes of the graph 
quickly, which forms the basis of the above impossibility results. 

This poses the question whether non-trivial lower bounds also hold in the case where the com- 
munication graph is well-connected. After all, there are many networks that do not feature small 
cuts, some due to natural expansion properties, others by design. Also, e.g. in overlay networks, 
the underlying network structure might be hidden entirely and algorithms may effectively operate 
in a fully connected system, albeit facing bandwidth limitations. Furthermore, while for scalability 
reasons full connectivity may not be applicable on a system-wide level, it could prove useful to 
connect multiple cliques that are not too large by a sparser high-level topology. 

These considerations motivate to study distributed algorithms for a fully connected system of 
n nodes subject to a bandwidth limitation of O(logn) bits per round and edge, which is the topic 
of the present paper. Note that such a system is very powerful in terms of communication, as 
each node can send and receive 0(nlogn) bits in each round, summing up to a total of 0(n 2 logn) 
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bits per round. Consequently, it is not too surprising that, to the best of our knowledge, so 
far no negative results for this model have been published. On the positive side, a minimum 
spanning tree can be constructed in O (log log n) rounds [6], and, given to each node the neighbors 
of a corresponding node in some graph as input, it can be decided within C(n 1 / 3 /logn) rounds 
whether the input graph contains a triangle [lj. These bounds are deterministic; constant-round 
randomized algorithms have been devised for the routing [5] and sorting [8] tasks that we solve 
deterministically in this work. 

Our Contribution. We show that the following closely related problems can be deterministi- 
cally solved, within a constant number of communication rounds in a fully connected system where 
messages are of size O(logn). 

Routing: Each node is source and destination of (up to) n messages of size O(logn). Initially 
only the sources know destinations and contents of their messages. Each node needs to learn 
all messages it is the destination of. ( Section 3 ) 
Sorting: Each node is given (up to) n comparable keys of size O(logn). It needs to learn about 
the indices of its keys in a global enumeration of the keys that respects their order. We 
also consider the case where nodes need to learn the indices of their keys in the total order 
of the union of all keys (i.e., all duplicate keys get the same index). Note that this implies 
constant-round solutions for related problems like selection or determining modes. (Section 4) 
While these results are no lower bounds, they shed some light on why it is hard to provide impos- 
sibility results for this model: Even without randomization, the overhead required for coordinating 
the efforts of the nodes is constant. In particular, any potential lower bound for the considered 
model must, up to constant factors, also apply in a system where each node can in each round send 
and receive O(nlogn) bits to and from arbitrary nodes in the system, with no further constraints 
on communication. 

To complete the picture, in Section 5 we vary the parameters of bandwidth, message/key size, 
and number of messages/keys per node. Our techniques are sufficient to obtain asymptotically 
optimal results for almost the entire range of parameters. For keys of size o(logn), we show that 
in fact a huge number of keys can be sorted quickly; this is the special case for which our bounds 
might not be asymptotically tight. 



2 Model 

In brief, we assume a fully connected system of n nodes under the congestion model. The nodes have 
unique identifiers 1 to n that are known to all other nodes. Computation proceeds in synchronous 
rounds, where in each round, each node performs arbitrary, finite computations}]] sends a message 
to each other node, and receive the messages sent by other nodes. Messages are of size O(logn), 
i.e., in each message nodes may encode a constant number of integer numbers that are polynomially 
bounded in nj^] To simplify the presentation, nodes will treat also themselves as receivers, i.e., node 
i € {1, . . . , n} will send messages to itself like to any other node j ^ i. 

These model assumptions correspond to the congestion model on the complete graph K n = 
(V, (^)) on the node set V = {1, . . . ,n} (cf. |9j). We stress that in a given round, a node may 
send different messages along each of its edges and thus can convey a total of O(nlogn) bits of 
information. As our results will show, this makes the considered model much stronger than one 
where each node merely broadcasts the same O(logn) bits to all other nodes in each round. 

x Our algorithms will perform polynomial computations with small exponent only. 

2 We will not discuss this constraint when presenting our algorithms and only reason in a few places why messages 
are not too large; mostly, this should be obvious from the context. 
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3 Routing 



In this section, we derive a deterministic solution to the following task introduced in [3]. 

Problem 3.1 (Information Distribution Task). Each node i £ V is given a set of n messages of 
size O(logn) 

S v = {mj, . . . , m?} 

with destinations d(m\) E V, j E {l,...,n}. Messages are globally lexicographically ordered by 
their source i, their destination d{m\), and j. For simplicity, each such message explicitly contains 
these values, in particular making them distinguishable. The goal is to deliver all messages to their 
destinations, minimizing the total number of rounds. By 




we denote the set of messages a node k E V shall receive. We require that also \TZk\ = n for all 
k E V, i.e., the maximum number of messages a single node needs to receive is also n. 



3.1 Basic Communication Primitives 

Let us first establish some basic communication patterns our algorithms will employ. We will utilize 
the following corollary of Hall's theorem given in |IJ. For the sake of completeness, we reiterate the 
straightforward proof as well. 

Corollary 3.2. Every d-regular bipartite multigraph is a disjoint union of d perfect matchings. 

Proof. By induction on d. For d = 1 the graph is a perfect matching by definition. 

Assume that the claim holds for some d, and let H = (L, R, E) be a (d + l)-regular bipartite 
graph. Let W C L be some set of vertices, and define r(W) := {u E R : 3v E W s.t. (v,u) E E}. 
By regularity, the sum of degrees in W is exactly (d + 1)\W\, and by the pigeonhole principle and 
regularity |r(W)| > (d + l)|W|/(d + 1) = \W\, satisfying Hall's marriage condition thus implying 
that a perfect matching exists. Removing the perfect matching found from the graph leaves a ir- 
regular bipartite graph that is a disjoint union of d perfect matchings by the induction hypothesis. 
Adding those d perfect matchings to the one just obtained completes the proof. □ 



This enables to solve |Problem 3.1 efficiently provided that it is known a priori to all nodes what 



the sources and destinations of messages are. This result has also been stated in [Tj. We will need a 
more general statement applying to subsets of nodes that want to communicate among themselves. 
To this end, we first formulate a simple generalization of the result from [JJ that assumes edges of 
large capacity. 

Lemma 3.3. Given a bulk of messages and f E N, such that: 

1. The source and destination of each message is known in advance to all nodes, and each source 
knows the contents of the messages to send. 

2. Each node is the source of m := fn messages. 

3. Each node is the destination m messages. 

4- Each node can send up to f messages to each other node in each round. 
A routing scheme to deliver all messages within 2 rounds can be found efficiently. 
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Proof. Consider the bipartite multigraph G = (SCiR, E) with \S\ = \R\ = n, where S = {l s , . . . , n s } 
and R = {l r , ...,n r } represent the nodes in their roles as senders and receivers, respectively, and 
each input message at some node i that is destined for some node j induces an edge from i s to j r . 



By Corollary 3.2, we can color the edge set of G with m colors such that no two edges with the 
same color have a node in common. Moreover, as all nodes are aware of the source and destination 
of each message, they can deterministically and locally compute the same such coloring, without 
the need to communicate. Now, in the first communication round, each node sends its (unique) 
message of color c 6 {l,...,m} to node cmodn. As each node holds exactly one message of 
each color, exactly / messages are sent over each edge, i.e., by the assumptions of the lemma 
this step can indeed be performed in one round. Observe that this rule ensures that each node 
will receive exactly one message of each color in the first round. Hence, because the coloring also 
guarantees that each node is the destination of exactly one message of each color, it follows for each 
i,j £ {1, . . . , n} that node i receives exactly / messages that need to be delivered to node j in the 
first round. Therefore all messages can be delivered by directly sending them to their destinations 
in the second round. □ 

From this lemma, we can easily draw the conclusion that if we partition the node set, the 
respective subsets can communicate among themselves with a large bandwidth in a non-interfering 
way (granted that sources and destinations of messages are known a priori) . 

Corollary 3.4. We are given a subset W C V and a bulk of messages such that the following 
holds. 

1. The source and destination of each message is inW . 

2. The source and destination of each message is known in advance to all nodes in W , and each 
source knows the contents of the messages to send. 

3. Each node is the source of f\W\ messages, where f := L n /I^IJ- 
4- Each node is the destination of f\W\ messages. 

Then a routing scheme to deliver all messages within 2 rounds can be found efficiently. The routing 
scheme makes use of edges with at least one endpoint in W only. 

Proof. W.l.o.g. we assume that n is an integer multiple of \W\, i.e., n = f\W\ (otherwise we just 
ignore some of the nodes in V \ W) . 

We partition the nodes into disjoint subsets of size \W\. For each subset, we can define a one- 
to-one mapping of the nodes in W to nodes in the subset, and there are exactly n/|VF| different 
subsets. For each node i S W we thus can make use of / = re/|W| nodes that will support i 
in its duty as "relay". Note that this strategy will use only edges involving at least one node in 
W and, because the subsets are disjoint and the mappings one-to-one, no edge is used more than 
once in each direction in each of the two rounds. Moreover, in the first round each sender may 
incorporate the information where to send the message in the second round, merely increasing 
message size by O(logn) bits in doing so. Hence, we can logically identify each of the "relay" nodes 
with its associated node in W, resulting in a fully connected system of \ W\ nodes where each node 
can transmit / messages over each edge in each round. With this observation, the claim of the 



corollary directly follows from Lemma 3.3 □ 



An observation that will prove crucial for our further reasoning is that for subsets of size at 
most \fn, the amount of information that needs to be exchanged in order to establish common 
knowledge on the sources and destinations of messages becomes sufficiently small to be handled. 
Since this information itself consists, for each node, of \W\ numbers that need to be communicated 
to \W\ < n/|W| nodes — with sources and destination known a priori! — we can solve the problem 
for unknown sources and destinations by applying the previous lemma twice. 
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Corollary 3.5. We are given a subset W C V, where \W\ < y/n, and a bulk of messages such that 
the following holds. 

1. The source and destination of each message is in W . 

2. Each source knows the contents of the messages to send. 

3. Each node is the source of f\W\ messages, where f := \n/\W\\. 
4- Each node is the destination of f\W\ messages. 

Then a routing scheme to deliver all messages within 4 rounds can be found efficiently. The routing 
scheme makes use of edges with at least one endpoint in W only. 

Proof. Each node in W announces the number of messages it holds for each node in W to all nodes 
in W. This requires each node in W to send and receive \W\ 2 < f\W\ messages. As sources and 
destinations of these helper messages are known in advance, by |Corollary 3.4] we can perform this 



preprocessing in 2 rounds. The information received establishes the preconditions of Corollary 3.4 
for the original set of messages, therefore the nodes now can deliver all messages in another two 
rounds. □ 

3.2 Solving the Information Distribution Task 



Equipped with the results from the previous section, we are ready to tackle Problem 3.1 In the 
pseudocode of our algorithms, we will use a number of conventions to allow for a straightforward 
presentation. When we state that a message is moved to another node, this means that the receiving 
node will store a copy and serve as the source of the message in subsequent rounds of the algorithm, 
whereas the original source may "forget" about the message. A step where messages are moved 
is thus an actual routing step of the algorithm; all other steps serve to prepare the routing steps. 
The current source of a message holds it. Moreover, we will partition the node set into subsets 
of size y/n, where for simplicity we assume that y/n is integer. We will discuss the general case 
in the main theorem. We will frequently refer to these subsets, where W will invariably denote 
any of the sets in its role as source, while W will denote any of the sets in its role as receiver 
(both with respect to the current step of the algorithm). Finally, we stress that statements about 
moving and sending of messages in the pseudocode do not imply that the algorithm does so by 
direct communication between sending and receiving nodes. Instead, we will discuss fast solutions 
to the respective (much simpler) routing problems in our proofs establishing that the described 
strategies can be implemented with small running times. 



This being said, let us turn our attention to Problem 3.1 The high-level strategy of our solution 



is given in Algorithm 1 



Algorithm 1: High-level strategy for solving Problem 3.1 



1. Partition the nodes into the disjoint subsets {(i — l)y/n + 1, . . . , iy/n} for i 6 {1, . . . , y/n}. 

2. Move the messages such that each such subset W holds exactly |W||W| = n messages for 
each subset W . 

3. For each pair of subsets W, W , move all messages destined to nodes in W' within W such 
that each node in W holds exactly \W'\ = y/n messages with destinations in W' . 

4. For each pair of subsets W, W, move all messages destined to nodes in W from W to W . 

5. For each W, move all messages within W to their destinations. 



Obviously, following this strategy will deliver all messages to their destinations. In order to 
prove that it can be deterministically executed in a constant number of rounds, we now show that 



5 



all individual steps can be performed in a constant number of rounds. Clearly, the first step requires 
no communication. We leave aside Step 2 for now and turn to Step 3| 



Corollary 3.6. Step 3 of Algorithm 1 can be implemented in 4 rounds. 



Proof. The proof is analogous to Corollary 3.5 First, each node in W announces to each other 



node in W the number of messages it holds for each set W 1 . By Corollary 3.4, this step can be 
completed in 2 rounds, for all sets W in parallel. 

With this information, the nodes in W can deterministically compute (intermediate) destina- 
tions for each message in W such that the resulting distribution of messages meets the condition 



imposed by Step 3 Applying Corollary 3.4 once more, this redistribution can be performed in 
another 2 rounds, again for all sets W concurrently. □ 



Trivially, the Step 4 can be executed in a single round by each node in W sending exactly one 



of the messages with destination in W it holds to each node in W' . According to Corollary 3.5 



Step 5 can be performed in 4 rounds. 



Regarding Step 2 we follow similar ideas. Algorithm 2 breaks our approach to this step down 



into smaller pieces. 



Algorithm 2: Step 2 of Algorithm 1 in more detail 



l. 



2. 



3. 



4. 



Each subset W computes, for each set W, the number of messages its constituents hold in 
total for nodes in W'. The results are announced to all nodes. 

All nodes locally compute a pattern according to which the messages are to be moved 
between the sets. It satisfies that from each set W to each set W' , n messages need to be 
sent, and that in the resulting configuration, each subset W holds exactly |W||W'| = n 
messages for each subset W. 

All nodes in subset W announce to all other nodes in W the number of messages the need to 
move to each set W according to the previous step. 

All nodes in W compute a pattern for moving messages within W so that the resulting 
distribution permits to realize the exchange computed in Step 2 in a single round (i.e., each 
node in W must hold exactly \W'\ = \fn messages with (intermediate) destinations in W'). 
The redistribution within the sets according to Step 4 is executed. 
The redistribution among the sets computed in Step 2 is executed. 



We now show that following the sequence given in Algorithm 2[ Step 2 of Algorithm 1 requires 
a constant number of communication rounds only. 



Lemma 3.7. Step 2 of Algorithm 1 can be implemented in 7 rounds 



Proof. We will show for each of the six steps of Algorithm 2 that it can be performed in a constant 
number of rounds and that the information available to the nodes is sufficient to deterministically 
compute message exchange patterns the involved nodes agree upon. 



Clearly, Step 1 can be executed in two rounds. Each node in W simply sends the number of 
messages with destinations in the i th set W' it holds, where i G {1, . . . , \/n}, to the i th node in W. 
The i th node in W sums up the received values and announces the result to all nodes. 

Regarding Step 2, consider the following bipartite graph G = (S\JR,E). The sets S and R are 
of size -y/n and represent the subsets W in their role as senders and receivers, respectively. For each 
message held by a node in the i th set W with destination in the j th set W' , we add an edge from 



i G S to j G R. Note that after |Step 1 each node can locally construct this graph. As each node 



6 



needs to send and receive n messages, G is of uniform degree n 3//2 . By Corollary 3.2, we can color 



the edge set of G with n 3//2 colors so that no two edges of the same color share a node. We require 
that a message of color c G {1, . . . , n 3 / 2 } is sent to the (cmod y/n) th set. Hence, the requirement 
that exactly n messages need to be send from any set W to any set W' is met. By requiring that 
each node uses the same deterministic algorithm to color the edge set of G, we make sure that the 
exchange patterns computed by the nodes agree. 

Note that a subtlety here is that nodes cannot yet determine the precise color of the messages 
they hold, as they do not know the numbers of messages to sets W held by other nodes in W and 
therefore also not the index of their messages according to the global order of the messages. How- 
ever, they have sufficient knowledge to compute the number of messages they hold with destinations 



in set W' by themselves, which is good enough to perform Step 3 



As observed before, Step 3 can be executed quickly: Each node needs in S needs to announce 



n numbers to all other nodes in S, which by Corollary 3.4 can be done in 2 rounds. Now the 



nodes are capable of computing the color of each of their messages according to the assignment 



from Step 2 



With the information gathered in Step 3 it is now feasible to perform Step 4| This can be seen 
by applying Corollary 3.2 again, for each set W to the bipartite multigraph G = (WUR, E), where 



R represents the y/n subsets W in their receiving role with respect to the pattern computed in 



Step 2 and each edge corresponds to a message held by a node in W with destination in some W . 



The nodes can locally compute this graph due to the information they received in Steps [2] and [3j 
As G has degree n, we obtain an edge-coloring with n colors. Each node in W will move a message 
of color i G {1, . . . , n} to the (i mod y/n) th node in W, implying that each node will receive for each 
W exactly y/n messages with destination in W' . 

is, for each W, known to all nodes in W 



Since the exchange pattern computed in Step 4 



by 

Corollary 3.4 we can perform Step 5 for all sets in parallel in 2 rounds. Finally, Step 6| requires a 
single round only, since we achieved that each node holds for each W' exactly y/n messages with 



destination in W' (according to the pattern computed in Step 2 ) , and thus can send exactly one of 
them to each of the nodes in W' directly. 

Summing up the number of rounds required for each of the steps, we see that 2+0+2+0+2+1 = 
7 rounds are required in total, completing the proof. □ 



Overall, we have shown that each step of Algorithm 1 can be executed in a constant number 
of rounds if y/n is integer. It is not hard to generalize this result to arbitrary values of n without 
incurring larger running times. 



Theorem 3.8. Problem 3.1 can be solved deterministically within 16 rounds. 



Proof. If y/n is integer, the result follows from Lemma 3.7, Corollary 3.6, and Corollary 3.5, taking 



into account that the fourth step of the high-level strategy requires one round. 

If y/n is not integer, consider the following three sets of nodes: V\ := {1, . . . , L\/"-J 2 }> ^2 := 
{n - [y/n\ 2 + 1, . . . , n}, and V 3 := {1, . . . , n - [y/n\ 2 } U { [y/n\ 2 + 1, • • . , n}. V x and V 2 satisfy that 
\Vi I = |V2 1 = Lv^J 2 - Hence, we can apply the result for an integer root to the subsets of messages 
for which either both sender and receiver are in V\ or, symmetrically, in Vi. Doing so in parallel 
will increase the message size by a factor of at most 2. Note that for messages where sender and 
receiver are in V\ fl Vi we can simply delete them from the input of one of the two instances of the 
algorithm that run concurrently, and adding empty "dummy" messages, we see that it is irrelevant 
that nodes may send or receive less than n messages in the individual instances. 

Regarding V3, denote for each node i € V3 by Si C Si the subset of messages for which i and 
the respective receiver are neither both in V\ nor both in V2. In other words, for each message 
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in Si either i 6 V\ PI V3 and the receiver is in V2 fl V3 or vice versa. Each node i 6 V3 moves the 
jth messa g e i n 5^ t node j (one round). No node will receive more than \V2 H V3 1 = \V% D V3I 
messages with destinations in V\ fl V3, as there are no more than this number of nodes sending such 
messages. Likewise, at most | V2 H V3 1 messages for nodes in V2 n V3 are received. Hence, in the 
subsequent round, all nodes can move the messages they received for nodes in V\ n V3 to nodes in 
V\ n V3, and the ones received for nodes in V2 n V3 to nodes in V2 n V3 (one round). Finally, we apply 



Corollary 3.5 to each of the two sets to see that the messages Uiev 3 &i can De delivered within 4 
rounds. Overall, this procedure requires 6 rounds, and running it in parallel with the two instances 
dealing with other messages will not increase message size beyond O(logn). The statement of the 
theorem follows. □ 



4 Sorting 

In this section, we present a deterministic algorithm for the sorting problem formulated in [8]. 

Problem 4.1 (Sorting). Each node is given n keys of size O(logn) (i.e., a key fits into a message). 
We assume w.l.o.g. that all keys are different^ Each node needs to learn the indices of its keys in 
the total order of all keys. 



4.1 Sorting Fewer Keys with Fewer Nodes 

Again, we assume for simplicity that y/n is integer and deal with the general case later on. Our 
algorithm will utilize a subroutine that can sort up to 2n 3 / 2 keys within a subset W C V of y/n 
nodes, communicating along edges with at least one endpoint in the respective subset of nodes. 
The latter condition ensures that we can run the routine in parallel for disjoint subsets W. We 
assume that each of the nodes in W initially holds 2n keys. The pseudocode of our approach is 
given in Algorithm 3 



Algorithm 3: Sorting 2n 3 ^ 2 keys with \W\ = y/n nodes. Each node in W has In input keys 
and learns their indices in the total order of all 2n 3 / 2 keys. 

1. Each node in W locally sorts its keys and selects every (2y/n) th key according to this order 
(i.e., a key of local index i is selected if i mod2y / n = 0). 

2. Each node in W announces the selected keys to all other nodes in W. 

3. Each node in W locally sorts the union of the received keys and selects every ypn h key 
according to this order. We call such a key delimiter. 

4. Each node i € W splits its original input into y/n subsets, where the j th subset Ki « contains 
all keys that are larger than the (j — V) th delimiter (for j = 1 this condition does not apply) 
and smaller or equal to the j th delimiter. 

5. Each node i S W announces for each j \Kij\ to the all nodes in W. 

6. Each node i G W sends K{j to the j th node in W. 

7. Each node in W locally sorts the received keys. For each received key, the index in this local 
order is sent back to the node whose input contains the key. 

8. The nodes locally compute their input keys' indices in the total order of the input keys in W. 



Let us start out with the correctness of the proposed scheme. 

3 Otherwise we order the keys lexicographically by key, node whose input contains the key, and a local enumeration 
of identical keys at each node. 
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Lemma 4.2. When executing Algorithm 3, the nodes in W are indeed capable of computing their 



input keys' indices in the order on the union of the input keys of the nodes in W . 
Proof. Observe that because all nodes use the same input in |Step 3"! they compute the same set of 
delimiters. The set of all keys is the union (J}=4 Uiew/ ^ij, anc ^ the se ^ s K%j are disjoint. As the 
Kij are defined by comparison with the delimiters, we know that all keys in Kij are larger than 
keys in Ky j< for all i! S W and f < j, and smaller than keys in Kyji for all i! G W and j' > j. 
Hence, the global index of a key k G Kij equals 



j'-i 



E 



U K 'J 



i i'ew 



3-1 



EE 



\Ki>ji 



plus the index of k in the (induced) order on Ui'ew-^i'J' 



In 



Step 6 the j node in W learns 



exactly about Ui'ew^'ii" Hence it will return the latter index to i G W in Step 7 Because of 



Step 5 , i can compute 5Zf=i Si' 



gW_\Ki',y 



and by adding these two values it can determine the 



index of k in the global order in |Step 8 as claimed 



□ 



Before turning to the running time of the algorithm, we show that the partitioning of the keys 
by the delimiters is well-balanced. 

Lemma 4.3. When executing Algorithm 3\ for each j G {1, . . . , \/n} it holds that 



U 



K 



< An. 



Proof. Due to the choice of the delimiters, Uiew Kij contains exactly \fn of the keys selected in 



Step 1 of the algorithm. Denote by d\ the number of such selected keys in Kij. As in Step 1 
each node selects every (2\J\n)) th of its keys and the set Kij is a contiguous subset of the ordered 
sequence of input keys at w, we have that \Kij\ < 2y / n(dj + 1). It follows that 



ij 



Ei 



Kij\ < 2^^2(di + 1) = 2^(^+11^1) = 4n. 



□ 



We are now in the position to complete our analysis of the subroutine. 

Lemma 4.4. Given a subset W C V of size y/n such that each w G W holds 2n keys, each node 
in W can learn about the indices of its keys in the total order of all keys held by nodes in W within 
10 rounds. Furthermore, only edges with at least one endpoint in W are used for this purpose. 



Proof. By Lemma 4.2 Algorithm 3 is correct. Hence, it remains to show that it can be implemented 
with 10 rounds of communication, using no edges with both endpoints outside W. 

Steps [TJ [5J [3J and [8] involve local computations only. Since \W\ = \fn and each node selects 
exactly \fn keys it needs to announce to all other nodes, according to Corollary 3.4 Step 2 can be 



performed in 2 rounds. The same holds true for Step 5, as again each node needs to announce 
|W| = \fn values to each other node in W . In Step 6, each node sends its 2n input keys and, by 



|Lemma 4.3 receives at most 4n keys. By bundling a constant number of keys in each message, 
nodes need to send and receive at most n = \W\ ■ n/\W\ messages. Hence, Corollary 3.5 states 
that this step can be completed in 4 rounds. The same holds true for Step 7 in fact, however, we 



merely need to reverse the message paths used for Step 6, as sources and receivers simply switch 
roles. In total, we thus require + 2 + + 0-1-2-1-4-1-2 = 10 communication rounds. 



As we invoked Corollaries 3.4 and 3.5 in order to define the communication pattern, it imme- 
diately follows from the corollaries that all communication is on edges with at least one endpoint 
in W. □ 
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4.2 Sorting All Keys 



With this subroutine at hand, we can move on to IProblem 4.1 



Algorithm 3 where the subroutine in combination with 



It follows the same pattern as 
Theorem 3.8 enables that sets of size \fn 



can take over the function nodes had in Algorithm 3 This increases the processing power by factor 
y/n, which is sufficient to deal with all n 2 keys. Algorithm 4 shows the high-level structure of or 
solution. 



Algorithm 4: Solving Problem 4.1 



l. 



4. 



5. 



6. 



7. 



8. 



n 



Hi 



key (i.e., the index in the local order 



Each node locally sorts its input and selects every 
modulo y/n equals 0). 

Each node transmits its i th selected key to node i. 

Using Algorithm 3, nodes 1, . . . , y/n sort the in total n 3 / 2 keys they received (i.e., determine 



the respective indices in the induced order). 

Out of the sorted subsequence, every n th key is selected as delimiter and announced to all 
nodes (i.e., there is a total of y/n delimiters). 

Each node i £ V splits its original input into y/n subsets, where the j th subset K%j contains 
all keys that are larger than the (j — l) th delimiter (for j = 1 this condition does not apply) 
and smaller or equal to the j th delimiter. 

The nodes are partitioned into y/n disjoint sets W of size y/n. Each node i G V sends ifjj tc 
the j th set W (i.e., each node in W receives either U-^jl/I^IJ or [1^,^1/1^11 keys, and 
each key is sent to exactly one node). 

Using Algorithm 3, the sets W sort the received keys and send the resulting indices to the 



sending node from the previous step. 

The nodes locally compute their input keys' indices in the global order of the keys. 



The techniques and results from the previous sections are sufficient to derive our second main 
theorem without further delay. 



Theorem 4.5. Problem 4-1 can be solved in 47 rounds 



Proof. We discuss the special case of y/n £ N first, to which we can apply Algorithm 4 Correctness 



of the algorithm follows analogously to Lemma 4.2| Steps [TJ [5| and [8] require local computations 
only. |Step 2| involves one round of communication. |Step 3| calls |Algorithm 3 which by Lemma 4.4 
consumes 10 rounds. Step 4 can be executed in 2 rounds, since there are y/n nodes each of which 



needs to announce at most y/n values to all nodes. Regarding Step 6, observe that, analogously to 
Lemma 4.3 we have for each j £ {1, . . . , y/n} that 



U * 



i,3 



i&V 



hi 



< 



n(n+\V\) = 2n 3/2 



Hence, each node needs to send at most n keys and receive at most 2n keys. Bundling up to two keys 
in each message, nodes need to send and receive at most n messages. Therefore, by |Theorem 3.8| 



Step 6 can be completed within 16 rounds. Step 7 again calls Algorithm 3, this time in parallel 



for all sets W . Nonetheless, by Lemma 4.4| this requires 10 rounds only because the edges used 
for communication are disjoint. Moreover, nodes need to communicate the resulting indices to the 



nodes having these keys as input. By Theorem 3.8, this can be done in 16 rounds; however, as we 



merely need to reverse the message paths used in Step 6 we can save the 8 rounds that are not 
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used to actually move messages. Thus, Step 7 can be completed within 18 rounds. Overall, the 
algorithm runs for + 1 + 10 + 2 + + 16 + 18 + = 47 rounds. 

With respect to non-integer values of y/n, observe that we can increase message size by any 



constant factor to accommodate more keys in each message. This way we can work with subsets of 
size Lv^J an d similarly select keys and delimiters in Steps [l] and [4] such that the adapted algorithm 
can be completed in 47 rounds as well. □ 

We conclude this section with a corollary stating that the slightly modified task of determining 
each input key's position in a global enumeration of the different keys that are present in the system 
can also be solved efficiently. 



Corollary 4.6. Consider the variant of Problem 4-1 where each node is required to determine the 
index of its input keys in the total order of the union of all input keys. This task can be solved 
deterministically in a constant number of rounds. 

Proof. We apply our regular sorting algorithm with a minor modification. To this end, we make 
the keys distinguishable and run the algorithm, but pause it before returning the indices of the 



keys to the nodes having them as input, i.e., before the sending operation of Step 7 of Algorithm 4 



Next we select a single copy of each key. Note that the nodes that locally sort the keys can 
simply mark exactly the first copy of a any key n they have; the only problem here is that for the 
smallest key of the subsequence a node sorts it does not have the information whether the node 
sorting the next smaller subsequence has a copy of k as well (implying that only this node should 
mark a copy) . This can easily be solved by an additional round of communication where the nodes 
announce the largest key in the subsequence they sort. 

We now call the regular sorting algorithm, however, only for the selected keys. Here, we consider 
the nodes that sort the subsequences in the first instance as origins of the keys. After learning the 
indices the keys have according to this call, they can now return the respective values to the nodes 
that have these keys as input in the original problem (using the message paths the first instance 



determined for Step 7). Similarly to before, we may encounter the problem that a sorting node 



did not learn about the index of the smallest key k of its subsequence because it was not selected. 
Again, this can be solved by another round of communication where each such node announce the 
index of the largest key in its subsequence. 



Overall, we called Algorithm 4 two times, which by Theorem 4.5 requires a constant number of 
rounds, and performed two additional rounds of communication. As clearly the suggested scheme 
returns the correct values, this completes the proof. □ 



5 Varying Message and Key Size 



In this section, we discuss scenarios where the number and size of messages and keys for Problems 3.1 



and 4.1 vary. This also motivates to reconsider the bound on the number bits that nodes can 
exchange in each round: For message/key size of O(logn), communicating B E O(logn) bits over 
each edge in each round was shown to be sufficient, and for smaller B the number of rounds 
clearly must increase accordingly^] We will see that most ranges for these parameters can be 
handled asymptotically optimally by the presented techniques. For the remaining cases, we will 
give solutions in this section. 



4 Formally proving a lower bound is trivial in both cases, as nodes need to communicate their n messages to deliver 
all messages or their n keys to enable determining the correct indices of all keys, respectively. 
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5.1 Large Messages or Keys 

If messages or keys contain w(logn) bits and B is not sufficiently large to communicate a single 
value in one message, splitting these values into multiple messages is a viable option. For instance, 
with bandwidth B 6 ©(log n), a key of size 0(log 2 n) would be split into 0(log n) separate messages 
permitting the receiver to reconstruct the key from the individual messages. This simple argument 
shows that in fact not the total number of messages (or keys) is decisive for the more general 



versions of Problems 3.1 and |4.1[ but the number of bits that need to be sent and received by each 



node. If this number is in f](nlogn), the presented techniques are asymptotically optimal. 
5.2 Small Messages 



If we assume that in Problem 4.1 the size of messages is bounded by M E o(logra), we may hope that 
we can solve the problem in a constant number of rounds even if we merely transmit B £ O(M) 
bits along each edge. With the additional assumption that nodes can identify the sender of a 
message even if the identifier is not included, this can be achieved if sources and destinations of 



messages are known in advance: We apply Lemma 3.3 with m = n and observe that because the 



communication pattern is known to all nodes, knowing the sender of a message is sufficient to 
perform the communication and infer the original source of each message at the destination. 

On the other hand, if sources/destinations are unknown, consider inputs where f2(n 2 ) messages 
cannot be sent directly from their sources to their destinations (i.e., using the respective source- 
receiver edge) within a constant number of rounds. Each of these messages needs to be forwarded 
in a way preserving their destination, i.e., at least one of the forwarding nodes must learn about the 
destination of the message (otherwise correct delivery cannot be guaranteed). Explicitly encoding 
these values for f2(n 2 ) messages requires f2(n 2 log n) bits. Implicit encoding can be done by means of 
the round number or relations between the communication partners' identifiers. However, encoding 
bits by introducing constraints reduces (at least for worst-case inputs) the number of messages that 



can be sent by a node accordingly. These considerations show that in case of Problem 3.1, small 
messages do not simplify the task. 

5.3 Small Keys 



The situation is different for Problem 4.1 Note that we need to drop the assumption that all keys 
can be distinguished, as this would necessitate key size fi(logn). In contrast, if keys can be encoded 
with o(logn) bits, there are merely re *- 1 ) different keys. Hence, we can statically assign disjoint sets 
of log 2 ra nodes to each key k (for simplicity we assume that logra is integer). In the first round, 
each node binary encodes the number of copies it holds of k and sends the i th bit to log n of these 
nodes. The j th of the logn receiving nodes of bit i counts the number of nodes which sent it a 
1, encodes this number binary, and transmits the j th bit to all nodes. With this information, all 
nodes are capable of computing the total number of copies of k in the system. 

In order to assign an order to the different copies of k in the system (if desired), in the second 
round we can require that in addition the j th node dealing with bit i sends to node k E {1, . . . , n} 
the j th bit of an encoding of the number of nodes k' G {1, . . . ,k — 1} that sent a 1 in the first 
round. This way, node k can also compute the number of copies of k held by nodes k' < k, which 
is sufficient to order the keys as intended. 

It is noteworthy that this technique can actually be used to order a much larger total number 
of keys, since we "used" very few of the nodes. If we have K < n/log 2 n different keys, we can 
assign m := \n/K\ nodes to each key. This permits to handle any binary encoding of up to Lv^J 
many bits in the above manner, potentially allowing for huge numbers of keys. More generally, 
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each node can be concurrently responsible for B bits, improving the power of the approach further 
for non-constant B. At the same time, messages contain merely 2 bits (or a single bit, if we accept 
3 rounds of communication). 
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